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Abstract 


It  is  common  to  control  access  to  critical  information  based  on  the  need-to-know 
principle;  The  requests  for  access  are  authorized  only  if  the  content  of  the  requested 
information  is  relevant  to  the  requester’s  project.  We  formulate  such  a  dichotomous 
decision  in  a  machine  learning  framework.  Although  the  cost  for  misclassifying  exam¬ 
ples  should  be  differentiated  according  to  their  importance,  the  best-performing  error¬ 
minimizing  classifiers  do  not  have  ways  of  incorporating  the  cost  information  into  their 
learning  processes.  In  order  to  handle  the  cost  effectively,  we  apply  two  cost-sensitive 
learning  methods  to  the  problem  of  the  confidential  access  control  and  compare  their 
usefulness  with  those  of  error-minimizing  classifiers.  We  devise  a  new  metric  for  as¬ 
signing  cost  to  any  datasets.  From  the  comparison  of  the  cost-sensitive  classifiers  with 
error-minimizing  classifiers,  we  find  that  costing  demonstrates  the  best  performance  in 
that  it  minimizes  the  cost  for  misclassifying  the  examples  and  the  false  positive  using 
a  relatively  small  amount  of  training  data. 
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1  Introduction 


Securing  information  from  unauthorized  accesses  is  very  important  in  an  information- 
rich  society.  For  example,  project  managers  want  to  protect  their  trade  secrets  from 
employees  in  other  departments  as  well  as  outsiders.  For  the  purpose  of  indexing  and 
security,  confidential  information  is  grouped  into  containers  based  on  the  similarity  of 
their  contents  or  similar  levels  of  confidentiality.  A  secure  repository  (e.g.,  a  secured 
database)  holds  all  these  containers  encompassed  by  a  limited  access  system.  Requests 
to  access  the  confidential  information  may  occur,  for  example,  when  an  employee  is 
assigned  to  a  new  project  and  needs  to  access  background  knowledge.  A  set  of  access 
control  lists  (ACL)  might  be  compiled  manually  to  control  those  requests.  Each  item 
of  confidential  information  is  associated  with  an  ACL,  which  ensures  a  corresponding 
level  of  security  and  can  be  accessed  by  anyone  who  has  been  authorized.  However  this 
approach  has  a  crucial  security  weakness  in  that  a  user  who  is  authorized  to  a  segment 
of  confidential  information  in  a  container  is  actually  able  to  access  the  entire  container. 
For  example,  an  employee,  who  is  authorized  only  to  look  at  a  progress  report  on  the 
development  of  new  technology,  is  able  to  access  the  information  about  a  financial  plan 
for  that  project;  the  two  pieces  of  information  are  about  the  same  project  and  hence  are 
held  in  the  same  container.  Therefore  the  supervisor  of  the  collection  of  confidential 
information  will  either  hand  select  only  those  documents  that  he  will  let  the  user  see, 
or  completely  bar  access  to  the  entire  collection  rather  than  risk  exposing  documents 
that  should  not  be  exposed. 

Furthermore,  this  approach  is  inflexible.  It  does  not  allow  easy  adjustment  to  fre¬ 
quent  changes  of  a  user’s  task  assignment.  Project  assignments  for  an  employee  may 
be  changed  quite  often  and  hence  the  employee  needs  to  access  confidential  informa¬ 
tion  related  to  the  newly  assigned  project.  In  addition,  access  to  a  previously  assigned 
project  may  need  to  be  revoked.  In  order  to  ensure  that  authorized  access  is  granted  and 
unauthorized  access  is  denied,  the  ACLs  for  all  information  associated  with  the  project 
must  be  updated  according  to  the  rights  and  the  permissions  of  employees  assigned  to 
the  project. 

As  a  solution  for  these  problems,  we  developed  a  multi-agent  system  that  han¬ 
dles  the  authorization  of  requests  for  confidential  information  as  a  binary  classification 
problem  [13].  Instead  of  relying  on  coarse-grained  ACLs  and  handpicked  information, 
our  system  compares  the  content  of  requested  confidential  information  with  the  con¬ 
tent  of  the  requester’s  project  and  authorizes  the  request  only  if  the  two  are  relevant.  In 
the  case  of  either  acceptance  or  rejection,  the  event  can  be  logged  for  security  audits 
and  alarms.  By  doing  this,  our  system  allows  the  supervisor  a  means  of  specifying 
subsets  of  per-user  and  per-task  access  control  policies  and  a  way  to  automatically  en¬ 
force  them.  Since  the  proposed  system  learns  the  supervisor’s  decision  criteria  based 
on  a  small  number  of  supervisor-provided  examples,  the  supervisor  need  not  identify 
all  relevant  information.  Through  our  proposed  system,  it  then  becomes  possible  for 
the  supervisor  to  define,  assign,  and  enforce  a  security  policy  for  a  particular  subset  of 
confidential  information. 

Although  our  approach  showed  a  relatively  good  performance  [13],  we  believe 
there  is  a  room  for  improvement.  Previously  we  made  use  of  five  different  error¬ 
minimizing  classifiers  for  authorizing  the  requests  to  access  confidential  information. 
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We  believe  that  we  can  improve  our  results  by  taking  into  consideration  the  cost  caused 
by  misclassification.  In  particular,  it  is  undesirable  to  use  an  error-minimizing  classi¬ 
fication  method,  which  treats  all  mis-classification  costs  equally,  for  this  scenario  be¬ 
cause  primarily  it  classifies  every  example  as  belonging  to  the  most  probable  class.  For 
example,  suppose  that  there  are  100  medical  records  that  are  actually  comprised  of  5 
cancer  records  and  95  cold  records.  Without  considering  the  cost  for  misclassification 
(i.e.,  diagnosis),  an  error-minimizing  classifier  could  simply  achieve  the  lower  error 
rate  by  ignoring  the  minority  class,  even  though  the  actual  result  of  misdiagnosis  on 
cancer  is  far  worse  than  that  of  cold  (e.g.,  the  cost  will  be  really  high). 

In  this  paper  we  would  like  to  test  the  effectiveness  of  cost-sensitive  learning  for 
the  problem  of  confidential  access  control.  Section  2  compares  cost-sensitive  classifi¬ 
cation  with  error-minimizing  classification  in  terms  of  the  optimal  decision  boundary 
and  details  two  approaches  for  cost-sensitive  learning.  Section  3  describes  three  differ¬ 
ent  classification  methods  as  candidates  for  the  process  of  confidential  access  control. 
Section  4  describes  experimental  settings  and  empirical  evaluation  of  cost-sensitive 
learners.  Section  5  presents  related  work  and  section  6  presents  conclusions  and  future 
work,  respectively. 


2  Cost-Sensitive  Classification 

In  the  previous  section,  we  mentioned  briefly  the  reason  why  a  cost-sensitive  classifier 
is  better  suited  for  the  problem  of  confidential  access  control  than  an  error-minimizing 
classifier.  This  section  formalizes  the  principle  in  terms  of  the  optimal  decision  bound¬ 
ary  for  a  binary  classification  task  with  univariate  data. 

2.1  An  Illustrative  Example 

A  classification  method  is  a  decision  rule  to  assign  one  of  (or  more  than  one)  predefined 
classes  to  given  examples.  Some  of  them  produce  a  continuous  output  whereas  others 
produce  a  discrete  class  label.  The  optimal  decision  boundary  is  a  decision  criteria  that 
allows  a  classifier  to  produce  the  best  performance. 

Let  us  consider  a  hypothetical  example  in  figure  1  which  shows  two  classes  with 
overlapping  boundaries  due  to  their  intrinsic  randomness  -  their  actual  values  are  ran¬ 
dom  variables.  In  this  example,  the  probability  density  for  each  class  is  normal,  that  is, 
p(class  =  0|a?)  ~  N(p  o,  a§)  and  p(class  =  l|rc)  ~  N(p  j,  uf)1. 

If  the  cost  for  misclassification  is  equal,  where  is  the  optimal  decision  boundary 
( xe *)  for  a  binary  classification?  Assuming  that  we  know  how  two  probability  densities 
are  distributed,  it  is  relatively  easy  to  compute  the  optimal  decision  boundary.  Formally, 
let  P  be  the  probability  of  class  1  given  an  example  x. 

P(x\dass  =  1)  =  P(x\dass  =  0) 

P  =  1  -P 

P  =  0.5 

Vo  =  0.3500,  <70  =  0.1448, /ii  =  0.7000,  <n  =  0.1736 
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Figure  1 :  The  optimal  decision  boundary  for  a  binary  classification  will  be  determined 
by  considering  the  misclassification  cost. 


The  probability  of  an  example  x  belonging  to  class  1  is  0.5,  meaning  that  the  optimal 
decision  boundary  lies  in  the  center  of  two  class  distributions  (i.e.,  xe*  =  0.52).  There¬ 
fore  an  example  randomly  generated  will  be  assigned  to  class  1  if  its  value  is  greater 
than  0.52.  The  solid  line  represents  this  optimal  decision  boundary,  assuming  the  cost 
for  misclassifying  is  equal. 

Pivoting  the  optimal  boundary,  a  classifier  could  have  four  possible  classification 
outcomes;  “a”  is  “true  positive”  that  an  example,  x,  belongs  to  “class  1”  and  it  is 
classified  as  “class  1.”  “c”  is  “false  negative”  if  x  is  classified  as  “class  0.”  “d”  is 
"true  negative"  if  x  belongs  to  “class  0”  and  is  classified  as  “class  0.”  Finally,  “b”  is 
“false  positive”  if  x  is  classified  as  positive  [10].  Table  1  captures  this  information  as 
well  as  the  cost  (A,.,)  involved  in  those  four  outcomes.  Particularly,  A y  is  the  cost  for 
classifying  an  example  belonging  to  j  as  i. 

It  is  reasonable  to  evaluate  the  performance  of  a  classifier  by  computing  the  area 
under  the  regions’  boundaries.  In  particular,  the  true  area  for  class  1  is  the  sum  of  “a” 
and  “c.”  If  a  classifier  produces  results  like  those  in  table  1  (e.g.,  “a”  and  “b”  for  “class 
1”  and  “c”  and  “d”  for  “class  0”),  then  the  false  negative  of  this  classifier  is  roughly 
15%  (i.e.,  percentage  of  “c”  out  of  the  whole  area  of  “class  1”)  and  the  false  positive  is 
10%  (i.e.,  10%  of  class  0),  respectively. 

Where  then  would  be  the  optimal  decision  boundary  if  the  cost  for  misclassifying 
is  unequal.  Let  us  assume  that  text  documents  belonging  to  “class  0”  and  “class  1” 
are  confidential  information  of  which  careless  release  may  have  a  damaging  effect. 
An  employee  is  newly  assigned  to  a  project  for  which  records  are  stored  in  a  secured 


true  class  =  1 

true  class  =  0 

output  class  =  1 

a  (An) 

b  (Aio) 

output  class  =  0 

c (Aqi) 

d  (Aqo) 

Table  1 :  Four  possible  outcomes  and  their  costs  for  a  binary  classification  are  presented. 
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repository  and  are  labeled  as  “class  1.”  Since  the  documents  belonging  to  “class  0” 
are  about  different  projects,  the  employee  is  supposed  to  access  to  only  documents  in 
“class  1”  for  understanding  background  knowledge  of  the  project. 

Assuming  that  no  cost  is  assigned  to  the  correct  classification  (An  =  Aoo  =  0),  the 
costs  for  two  errors  should  be  considered  carefully  for  providing  a  reliable  confidential 
access  control;  false  negative  (Aoi)  -  reject  the  valid  request  (e.g.,  reject  the  request 
that  the  employee  asks  to  access  a  “class  1”  document  by  predicting  the  requested 
document  as  “class  0”);  false  positive  (Aio)  -  accept  the  invalid  request  (e.g.,  accept 
the  request  that  the  employee  asks  to  access  a  “class  0”  document  by  predicting  the 
requested  document  as  “class  1”).  As  the  results  from  such  invalid  authorizations, 
a  false  negative  causes  the  employee  to  be  inconvenienced  because  he  is  not  able  to 
access  need-to-know  information.  However,  not  approving  valid  requests  does  not 
cause  a  serious  problem  from  the  security  perspective.  On  the  contrary,  a  false  positive 
is  a  serious  problem  because  confidential  information,  which  should  not  be  revealed, 
can  be  accessed.  Therefore,  for  a  need-to-know  basis  confidential  authorization,  the 
cost  for  false  positive  is  much  higher  than  that  of  false  negative.  Thus  it  is  reasonable 
to  re-locate  the  decision  boundary  for  uniform-cost  (i.e.,  solid  line  in  the  figure  1), 
in  order  to  minimize  the  cost  for  misclassifications.  For  example,  if  the  cost  of  false 
positive  is  higher  than  that  of  false  negative,  the  decision  line  should  be  moved  toward 
to  the  right  (i.e.,  xR*).  Two  dashed  lines  in  the  figure  1  represent  the  optimal  decision 
boundaries  for  non-uniform  misclassification  cost  assigned  to  each  example. 

However  a  tradeoff  must  be  considered  because  choosing  one  of  the  extremes  (e.g., 
xL*  or  xR*  )  will  sacrifice  the  error  that  is  not  considered.  In  particular,  the  classifier 
could  reduce  the  false  negative  close  to  zero  if  we  would  choose  xL*  as  a  decision 
line,  but  with  higher  false  positive.  If  either  of  extremes  is  not  the  solution,  the  opti¬ 
mal  decision  line  should  be  chosen  somewhere  between  extremes  by  considering  the 
tradeoff: 


cost\P(y  =  l|x)  =  cost^Piy  =  0|af) 
costiP  =  cost.2(l  —  P) 

COSt\P  =  COSt'2  —  COSt2P 
P  _  cost2 

COSt\  +  COSt2 

2.2  Wrappers  for  Cost-Sensitive  Classification 

In  the  problem  of  unequal  misclassification  cost,  the  example  space  is  optimally  divided 
into  \C\  regions  so  that  class  j  is  the  optimal  (i.e.,  least-cost)  prediction  in  region  j.  The 
goal  of  cost-sensitive  learning  is  to  find  the  boundary  between  these  regions.  Obviously 
the  misclassification  cost,  particularly  a  loss  matrix  (e.g.  table  1),  is  the  dominant  factor 
for  the  optimal  boundaries.  That  is,  the  region  where  j  must  be  predicted  will  expand 
at  the  expense  of  the  regions  of  other  classes  if  misclassifying  examples  of  class  j  is 
more  expensive  relative  to  misclassifying  others,  even  though  the  class  probabilities 
remain  unchanged. 

There  have  been  two  major  approaches  for  cost-sensitive  learning.  The  first  one  is 
a  glass-box  approach  that  modifies  particular  error-minimizing  classifiers  cost-sensitive 
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[  1  ] ,  [  1 6] .  The  second  one  is  a  black-box  approach  that  converts  arbitrary  error- minimizing 
classifiers  into  cost-sensitive  ones  [19],  [2]. 

In  this  paper,  we  utilize  two  methods  in  the  black-box  approach  for  cost-sensitive 
learning:  costing  [19]  and  metacost  [2],  A  black-box  approach  for  cost-sensitive  learn¬ 
ing  makes  any  error-minimizing  learning  method  carry  out  cost-sensitive  learning.  In 
particular,  they  make  use  of  sampling  techniques  that  change  the  original  example  dis¬ 
tribution  D  to  D  by  incorporating  into  it  the  relative  cost  of  each  instance.  Then  they 
make  any  cost-insensitive  error-minimizing  classifiers  perform  expected  cost  mini¬ 
mization  on  the  newly  generated  distribution,  D.  According  to  a  given  cost  matrix, 
this  changes  the  proportion  of  a  certain  class  by  re-sampling  of  the  original  examples 
instead  of  modifying  the  learner’s  rule. 


2.2.1  Costing 

Costing  (cost  proportionate  rejection  sampling  with  aggregation)  is  a  wrapper  for  cost- 
sensitive  learning  that  trains  a  set  of  error-minimizing  classifiers  by  a  distribution, 
which  is  the  original  distribution  with  the  relative  cost  of  each  example,  and  outputs 
a  final  classifier  by  taking  the  average  over  all  learned  classifiers  [19].  It  assumes 
that  changing  the  original  example  distribution  D  to  another  D,  by  combining  it  with 
the  cost  information,  makes  any  error  minimizing  classifier  accomplish  expected  cost 
minimization  on  the  original  distribution.  Costing  is  comprised  of  two  processes:  re¬ 
jection  sampling  and  bagging.  Rejection  sampling  has  been  used  to  generate  indepen¬ 
dently  and  identically  distributed  (i.i.d.)  samples  that  are  used  as  a  proxy  distribution 
to  achieve  simulation  from  the  target  distribution.  To  this  end,  it  requires  a  density 
function  g[x)  and  a  constant  M  >  1,  satisfying  the  “envelope  property” 

tt(x)  <  Mg(x) 

Given  the  a  density  function  g( x)  satisfying  the  “envelope  property”,  rejection  sam¬ 
pling  works  as  follows:  draw  x  from  g( x)  and  a  sample  u  from  a  uniform  distribution 
U,  Vm  6  [0, 1]  and  accept  x  if  u  <  Mg(x)  •  Otherwise  reject  the  value  of  x  and  repeat 
the  sampling  step  [11].  The  accepted  values  are  regarded  as  a  realization  of  w(x).  In 
particular,  suppose  we  run  the  sampling  N  times,  and  can  estimate  /i  by  using  the  N 
accepted  samples  because  those  samples  are  i.i.d.  samples  from  7r(a;). 

Rejection  sampling  for  costing  assigns  each  example  in  the  original  distribution 
with  a  relative  cost 2  and  draws  a  random  number  r  €  [0, 1]  from  a  uniform  distribution 
U .  It  will  keep  the  example  if  r  >  .  Otherwise  it  discards  the  example  and  continues 

sampling  until  a  certain  criteria  is  satisfied.  The  accepted  examples  are  regarded  as  a  re¬ 
alization  of  the  altered  distribution,  D,  D  =  {Sl7  S2,  ■■■,  Sk}.  With  the  altered  distribu¬ 
tion,  D,  costing  trains  k  different  hypotheses,  hi  =  Learn(Si),  and  predicts  the  label 
of  an  test  example,  x,  by  combining  those  hypotheses,  h(x)  =  sign  hi(x)  j . 


2X{  =  x  Xi,  where  c  is  a  cost  assigned  to  Xi  and  Z  is  maxcgs  c. 
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2.2.2  Meta-costing 

The  MetaCost  is  another  method  for  converting  an  error-minimizing  classifier  into  cost- 
sensitive  classifier  by  re-sampling  [2],  The  underlying  assumption  is  that  an  ordinary 
classifier  for  error-minimization  could  learn  the  optimal  decision  boundary  based  on 
the  cost  matrix  if  each  training  example  is  relabeled  with  the  cost.  The  learning  process 
of  MetaCost  is  comprised  of  two  processes:  bagging  and  retraining  the  classifiers  with 
cost.  In  particular,  it  generates  a  set  of  samples  with  replacement  from  the  training 
set  and  estimates  the  class  of  each  instance  by  taking  the  average  of  votes  over  all 
the  trained  classifiers.  Then  the  MetaCost  re-labels  each  training  example  with  the 
estimated  optimal  class  and  re-trains  the  classifier  to  the  relabeled  training  set. 

R(i\x)  =  arg  min  {  P(  j  |  x)C(i,  j)} 

3 

where  R(i\x)  is  the  expected  cost  of  predicting  that  x  belongs  to  the  ith  class  and 
P(j \x)  is  the  Bayes  optimal  classification. 

3  Binary  Classification  For  Supporting  Critical  Deci¬ 
sion 

Our  goal  is  to  develop  a  reliable  process  for  confidential  access  control  based  on  the 
need-to-know  principle;  the  request  for  access  to  a  unit  of  confidential  information  is 
accepted  only  if  the  content  of  the  requested  item  is  relevant  to  the  requester’s  task.  It 
is  reasonable  to  verify  whether  or  not  the  content  of  a  requested  confidential  item  is 
associated  with  the  content  of  a  requester’s  project  because  the  requester  only  needs 
to  know  information  related  to  his/her  project  in  order  to  conduct  the  given  task.  In 
other  words,  a  request  for  confidential  that  the  requester  does  not  “need  to  know’’  is 
undoubtedly  rejected.  To  this  end,  we  model  such  a  dichotomous  decision  (i.e.,  to  re¬ 
ject  or  accept  the  request)  in  a  machine  learning  framework.  We  choose  four  different 
classification  methods,  linear  discriminant  analysis,  logistic  regression,  support  vec¬ 
tor  machines,  and  naive  Bayes  classifier,  because  of  their  relative  good  performance, 
particularly  in  text  classification  [5],  [12],  [14],  [9]. 

3.1  Linear  Discriminant  Analysis 

Linear  Discriminant  Analysis  (LDA)  is  a  well  known  method  in  statistical  pattern 
recognition  that  projects  the  observed  patterns  into  a  low  dimensional  space  in  which 
the  classes  are  well  separated  [6].  In  particular,  LDA  produces  an  optimal  linear  dis¬ 
criminant  function  f(x)  =  W1  x  which  maps  the  input  example  into  the  classification 
space  in  which  the  class  identification  of  this  sample  is  decided  based  on  some  metric 
such  as  Euclidean  distance  and  Mahalanobis  distance.  A  typical  LDA  implementation 
is  carried  out  via  scatter  matrix  analysis.  We  compute  the  within  and  between-class 
scatter  matrices  as  follows: 

1  M 

s »  =  YiJ^p{Ci)Y,i  (1) 
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where  pi  and  S,  is  the  mean  vector  and  covariance  matrix  of  the  *th  class,  Sw  is  the 
within-class  scatter  matrix  showing  the  average  scatter  S,  of  the  sample  vectors  x 
of  different  class  C,  around  their  respective  mean  /i,; ,  Sb  is  the  between-class  scatter 
matrix,  representing  the  scatter  of  the  conditional  mean  vectors  /i,’s  around  the  over¬ 
all  mean  vector  jx.  Various  measures  are  available  for  quantifying  the  discriminatory 
power,  the  commonly  used  one  being  [6] : 


J(W) 


\\WTSwW\\ 

\\wTsbw\\ 


where  W  is  the  optimal  discrimination  projection  and  can  be  obtained  via  solving  the 
generalized  eigenvalue  problem: 


SbW  =  A  SwW 

The  distance  measure  used  in  the  matching  could  be  a  simple  Euclidean  or  Maha- 
lanobis.  However  for  our  case  -  a  binary  classification  whether  a  document  belongs 
to  the  need-to-know  confidential  or  not  -  Euclidean  distance  is  used  because  the  max¬ 
imum  rank  of  Sb  is  \C\  —  1,  where  \C\  is  the  number  of  classes,  meaning  that  LDA 
cannot  produce  more  than  |Cj  —  1  features.  LDA  has  been  used  in  [12]  as  a  text  clas¬ 
sification  method  and  in  [14]  as  a  feature  selection  method. 


3.2  Logistic  Regression 

Logistic  regression  is  a  statistical  technique  for  modeling  a  binary  response  variable  by 
a  linear  combination  of  one  or  more  features,  using  a  logit  link  function 

P(class  =  l|w,  x)  =  ip  (w2  x)  , 
where, 


t((wTx)  = 


exp(\ 


1 


1  +  exp(wTx)  1  +  exp- 


The  Bayesian  approach  to  the  logistic  regression  assumes  gaussian  priors,  p(class  = 
0|x)  ~  N(p o,  erg  (and p(class  =  l|at)  ~  N(p, i,  erf).  In  order  to  find  the  Maximum  A 
posterior  Probability  (MAP)  estimate  of  w: 

L(w)  =  argmaxIPjwjjD)} 


=  arg max  <  P(\Vj)  JJ[ 


f~Ji  1  +  exp(-wj  Xim 

n 

=  arg  max  \lnP(  w  j)  -Emi  +  exp(— w Jx,t/i)) 


where  P(w)  is  the  prior  on  w  and  n  is  the  number  of  the  training  examples. 
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3.3  Support  Vector  Machines 

Support  Vector  Machines  (SVMs)  learns  the  parameters  w  and  b  specifying  a  linear 
decision  rule  h(x)  =  sign(yv  ■  x  +  b),  so  that  the  smallest  distance  between  each 
training  example  and  the  decision  boundary  (i.e.,  margin)  is  maximized  [15].  It  works 
by  solving  the  following  optimization  problem: 

minw.ft.e  |wT  ■  w  +  C]T”=1  & 

subject  to  Vi,  yt  (w T (j)(xi)  +  b)  >  1  -  f  *,  &  >  0 

Here  training  vector,  a is  mapped  into  a  higher  dimensional  space  by  the  function  cj). 
Then  SVM  finds  a  linear  separating  hyperplane  with  the  maximal  margin  in  this  higher 
dimensional  space.  C  >  0  is  the  penalty  parameter  of  the  error  term.  The  constraints 
require  that  all  examples  in  the  training  set  are  classified  correctly  up  to  some  slack 
variable  .  If  a  training  example  lies  on  the  wrong  side  of  the  decision  boundary,  the 
corresponding  £;  is  greater  than  1 .  Therefore,  is  an  upper  bound  on  the  number 

of  training  errors.  The  factor  C  is  a  parameter  that  allows  one  to  trade  off  training  error 
and  model  complexity. 

3.4  Naive  Bayes  Classifier 

A  Bayesian  learning  framework  assumes  that  the  examples  were  generated  by  a  para¬ 
metric  model  and  uses  training  data  to  compute  Bayes-optimal  estimates  of  the  model 
parameters.  With  these  estimates,  it  classifies  new  test  examples  using  Bayes’  rule  to 
calculate  the  posterior  probability  that  a  class  would  have  generated  the  test  example 
in  question.  Then  classification  is  carried  out  by  selecting  the  most  probable  class.  In 
addition  to  this  framework,  the  naive  Bayes  classification  assumes  that  all  attributes  of 
the  examples  are  independent  of  each  other  given  the  context  of  the  class. 

The  naive  Bayes  classifier  we  used  is  a  multinomial  model  that  represents  an  ex¬ 
ample  as  the  set  of  attribute  occurrences  from  the  training  set.  It  is  assumed  that  the 
individual  attribute  occurrences  to  be  the  “events"  and  the  example  to  be  the  collection 
of  attribute  events.  It  is  known  to  perform  better  than  a  multi-variate  Bernouli  model, 
where  an  example  is  regarded  as  “event”  that  is  consisted  of  the  absence  or  presence  of 
attributes,  because  of  capturing  the  frequency  of  the  attributes  [9]. 

Naive  Bayes  classifier  predicts  the  class  Cj  that  maximizes  the  posterior  probability, 
P(cj  |x),  for  an  example  vector  x,  under  the  attribute  independence  assumption. 

P(c|x»)  =  argmax{P(cj)P(xj|cj)/P(x)} 

3 

=  argmax{P(c,)P(xj|c,)} 

3 

=  argmax  j  P(cj)  IT  P(ak\cj)ni-k 
J  1  k=i 

logP(afc|a,-)ni’fc 


f  Ml 

=  arg max  <  log  P(cj)  +  ^ 
1  I  k=  l 


8 


=  arg  max 
3 


log  P(Cj)+J2  riitk  log  P(ak\cj) 


where  |A|  is  the  total  number  of  attributes,  n.;^ 
the  ith  example,  P(cj)  =  Xy3\  and  P(ak\cj ) 


is  the  frequency  of  the  fcth  attribute  in 

_  1  +nj,fe _ 

lAl+^fc€|A|  n>’k' 


Although  the  assumption  on  the  feature  independence  is  unrealistic  in  a  real-world 
problem,  the  naive  Bayesian  classification  has  been  shown  to  be  surprisingly  effective 
and  is  computationally  efficient  [9].  In  other  words,  training  such  a  classifier  only 
requires  time  that  is  linear  in  the  number  of  features  and  data  instances,  meaning  that 
they  do  not  use  word  combinations  as  predictors  and  are  thus  far  more  efficient  than 
the  exponential  non-Bayes  approaches. 


4  Experiments 

As  we  described  earlier,  the  scenario  which  we  are  particularly  interested  in  is  a  process 
of  confidential  access  control  based  on  the  need-to-know  principle.  The  purpose  of  the 
experiments  is  two-fold;  to  find  a  good  classification  method  that  minimizes  the  cost 
and  the  false  positive  rate  while  holding  the  false  negative  rate  reasonably  low;  to  verify 
that  the  wrappers  for  cost-sensitive  learning  reduce  the  total  cost  loss  in  comparison 
with  error-minimizing  classifiers.  From  these  objectives,  three  performance  metrics 
are  primarily  used  to  measure  the  usefulness  of  classifiers;  false  negative,  defined  as 
fn  =  by  using  the  values  in  the  table  1.  false  positive,  fp  =  and  cost  for 
misclassification.  These  metrics  are  better  matched  to  our  purpose  than  conventional 
measures  based  on  precision-recall  because  we  are  interested  in  primarily  reducing  the 
error  and  the  cost.  Moreover,  two  error  measures  are  not  sensitive  to  changes  of  class 
frequency  whereas  the  precision  and  recall  are  sensitive  to  the  frequencies  of  the  target 
classes  [4], 

Since  there  are  no  datasets  available  that  are  comprised  of  confidential  information, 
we  choose  the  Reuters-21578  document  collections  for  experiments.  This  data  set, 
which  consists  of  world  news  stories  from  1987,  has  become  a  benchmark  in  text  cate¬ 
gorization  evaluations.  It  has  been  partially  labelled  by  experts  with  respect  to  a  list  of 
categories.  These  categories  have  been  grouped  into  super-categories  of  people,  topics, 
places,  organisations  etc.  The  category  distribution  is  skewed:  the  most  common  cate¬ 
gory  has  a  training-set  frequency  of  2,877,  but  82%  of  the  categories  have  less  than  100 
instances  and  33%  of  the  categories  have  less  than  10  instances.  There  are  135  overlap¬ 
ping  topic  categories.  Since  it  is  a  binary  classification  task  where  each  document  has 
an  exclusive  category  (i.e.,  either  positive  or  negative),  we  discarded  documents  that 
are  assigned  no  topic  or  multiple  topics.  Moreover,  classes  with  frequencies  less  than 
10  are  discarded.  The  resulting  data  set  is  comprised  of  9,854  documents  as  a  training 
set  and  4,274  documents  as  a  test  set  with  67  classes  (topics). 

Each  document  is  represented  by  a  multi-dimensional  vector  of  which  size  is  de¬ 
termined  by  the  size  of  the  vocabulary  and  its  element  corresponds  to  a  word  in  the 
vocabulary.  The  vocabulary  was  constructed  by  discarding  stop  words,  too  frequent 
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words,  and  rarely  occurring  words  3  without  stemming.  After  these  processes,  the  size 
of  the  original  vocabulary  (i.e.,  a  set  of  unique  unigrams)  was  reduced  to  7,114  from 
23,918.  In  order  to  reduce  the  dimensionality  further,  we  tested  three  different  feature 
selection  methods  such  as  x2  statistics,  information  gain,  and  point-wise  mutual  infor¬ 
mation.  We  found  the  performance  of  x2  method  best.  This  replicates  results  reported 
in  other  research  works  [18],  [17].  The  dimension  of  the  feature  space  is  finally  set  to 
1,000.  Previous  work  on  the  Reuters-21578  dataset  showed  that  such  a  drastic  reduc¬ 
tion  of  the  feature  space’s  dimension  does  not  degrade  the  performance  [5],  [9].  Each 
document  is  then  represented  by  using  those  selected  1,000  words  and  their  weights 
are  computed  by: 


M'i.fc  = 


tfi,k 


tfi,k  +  0.5  + 1.5  x 


;  dl 


N 

!og2  Wk  +  1.0 


where  tf  ,_k  is  the  frequency  of  the  fcth  word  in  the  7th  document,  dk  is  the  ?'th  document 
length,  ave  dl  is  the  average  document  length  in  the  training  documents,  N  is  the  total 
number  of  training  documents,  dfk  is  the  document  frequency  of  the  A:th  word.  The 
final  size  of  the  word-by-document  matrix  is  1000  x  9854,  which  is  reasonably  smaller 
than  the  original  matrix,  7114  x  9854.  For  succinct  representation  of  this  matrix,  we 
tested  two  different  techniques  for  dimension  reduction:  principal  component  analysis 
(PCA)  and  LDA/PCA.  Since  there  is  a  noticeable  performance  difference  between  two 
techniques  in  representing  90%  of  the  total  variance  4in  the  covariance  matrix,  we  used 
PCA  alone  for  concise  representation  of  the  word-by-document  matrix. 

The  experimental  setting  is  as  follows.  All  the  documents  are  regarded  as  confi¬ 
dential  and  accordingly  they  are  kept  in  a  secured  container,  ensuring  that  authorized 
users  are  only  allowed  to  access.  Documents  belonging  to  the  selected  category  are 
regarded  as  confidential  information  that  the  requester  needs  to  know.  Conversely  the 
rest  of  test  documents  are  confidential  information  that  should  not  be  revealed.  A  false 
positive  occurs  when  the  system  accepts  a  request  that  should  have  not  been  accepted 
whereas  a  false  negative  occurs  when  the  system  rejects  a  request  that  should  have  been 
accepted.  From  the  security  perspective,  it  is  more  tolerable  to  have  an  authorization 
process  with  a  high  false  negative  rate  than  one  with  a  high  false  positive  rate. 


4.1  Cost  Assignment 

According  to  the  class  assignment  -  not  the  original  category  label,  but  the  artificially 
assigned  class  label,  such  as  need-to-know  confidential  or  otherwise  (simply,  positive 
or  negative)  -  each  of  the  documents  in  both  the  training  and  testing  sets  is  assigned  by 
a  cost,  ensuring  that  the  mis-classification  cost  of  a  need-to-know  confidential  informa¬ 
tion  is  higher  than  that  of  remaining  confidential  (i.e.,  Aio  >  Aoi,  Aio  >  Aoo,  Aoi  > 
An)  [3],  Otherwise  unreasonable  assignment  of  cost  leads  a  classifier  to  always  predict 

’This  is  done  by  removing  words  if  their  document  frequency  is  less  than  a  threshold  of  “rarely  occurred” 
(e.g.,  3)  or  is  greater  than  the  threshold  (e.g.,  500)  of  “too  frequent.” 

4D  =  e^D,  where,  D  =  k  x  n,  D  =  m  x  n,k  <C  m,  k  =  k  1 —  >  90%  of  variance  in 

S,  £A  =  eA. 
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the  dominant  class,  regardless  of  what  the  true  class  is.  In  particular,  let  us  assume  that 
the  cost  of  first  row  in  the  table  1  is  greater  than  that  of  the  second  one  (i.e.,  Aiy  >  Ao j). 
A  classifier  then  always  predicts  “class=0”  regardless  of  its  learned  rules  (e.g.,  posterior 
probability  distribution). 

Since  the  Reuters-21578  document  collection  does  not  have  cost  information,  we 
devised  a  heuristic  for  cost  assignment.  It  complies  with  our  idea;  firstly,  there  is  a 
cost  involved  in  incorrect  classification;  secondly,  the  higher  cost  is  assigned  to  a  false 
positive  than  a  false  negative.  Particularly,  the  cost  for  misclassifying  a  document,  d,, 
is  computed  by: 


cosf(dj)  = 


s,  s  +  \cj  |]  if  di  6  Cj  and  cj  =  positive 


cost(  ds 


5  |  number  of  negative  documents  | 


Otherwise 


where  s  =  In  (y^y  j  x  100,  N  is  the  total  number  of  documents  and  e,  |  is  the  number 
of  documents  belonging  to  the  jth  category.  The  total  cost  for  misclassification  is  added 
to  the  cost  of  confidential  documents  misclassed  if  a  classifier  is  not  able  to  predict  any 
of  the  positive  cases,  in  order  to  prevent  the  case  that  a  low  cost  is  simply  achieved 
by  ignoring  the  class  with  a  low  frequency.  For  example,  there  are  10  out  of  10,000 
documents  belonging  to  the  positive  class.  The  cost  assignment  ensures  that  the  total 
cost  for  misclassifying  those  10  examples  should  be  either  equal  to  or  higher  than  that 
of  the  remaining  documents  5 .  This  heuristic  method  is  intended  to  prevent  the  case 
that  a  low  cost  can  be  achieved  simply  by  ignoring  the  minority  class.  In  particular, 
there  are  10  examples  out  of  10,000  examples  belonging  to  the  positive  class.  The  cost 
assignment  ensures  that  the  total  cost  for  misclassifying  those  10  examples  should  be 
either  equals  to  or  higher  than  that  of  remaining  documents  6  When  this  is  the  case,  a 
low  error  rate  can  be  achieved  simply  by  ignoring  the  confidential  class.  In  particular, 
a  dumb  classifier  will  achieve  99%  accuracy  by  simply  predicting  all  documents  as 
negative  and  it  will  pay  a  half  of  the  total  cost  for  its  incorrect  classification.  Obviously 
this  should  be  avoided.  To  this  end,  the  total  cost  for  misclassification  is  added  by 
the  cost  of  confidential  documents  misclassified  if  a  classifier  is  not  able  to  predict 
any  positive  cases.  The  previous  dumb  classifier  will  be  paying  13815.6  because  it 
classifies  all  positive  examples  incorrectly,  even  though  it  does  all  negative  examples 
correctly.  The  total  cost  is  computed  by  summing  the  misclassification  cost  of  positive 
and  negative  examples.  The  cost  for  misclassifying  positive  case  will  impose  to  a 
classifier  if  it  is  only  able  to  classify  all  non-confidential  correctly.  It  is  reasonable  that 
a  classifier,  which  predicts  the  label  of  all  positive  documents  incorrectly  and  does  all 
negative  correctly,  will  be  eventually  paying  the  same  amount  of  cost  paid  by  another 
classifier  that  classifies  all  documents  incorrectly. 

This  assignment  method  works  for  both  cost  assigned  for  each  example  and  cost 
assigned  for  each  case  (e.g.,  false  alarm  and  miss).  For  the  per  example  cost-sensitive 


5For  this  case,  the  cost  for  misclassifying  a  positive  document  is  690.67  (In  (  9^q°  )  X  100  =  690.6755) 
and  the  sum  of  the  cost  is  6906.755  (690.6755  x  10).  Accordingly  the  cost  of  misclassification  of  a  negative 
document  is  0.6913  =  0.6913)  and  the  cost  sums  to  6906.087. 

6For  this  case,  the  cost  for  misclassifying  a  positive  document  is  690.78  and  the  sum  of  the  cost  is  6907.8 
(690  X  10).  Accordingly  the  cost  of  misclassification  of  a  negative  document  is  0.6914  and  the  cost  sums  to 
6907.8. 
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Figure  2:  A  pair  of  of  false  positive  (filled  bar)  and  false  negative  (empty  bar)  for 
“livestock”  category  is  presented. 


learning,  the  misclassification  cost  should  be  paid  when  a  positive  is  classified  as  a 
negative.  For  the  previous  example,  the  cost  for  false  alarm  (i.e.,  Aio  in  the  table  1)  is 
690.67  whereas  the  cost  for  miss  (i.e.,  Aoi)  is  0.6913,  respectively. 

4.2  Experimental  Results 

We  choose  the  five  different  categories  as  representative  categories  according  to  their 
category  frequencies:  small  (livestock  and  corn),  medium  (interest),  and  large  (acq  and 
earn).  There  are  70  %  of  documents  in  a  category  used  as  “training”  and  the  remaining 
30  %  documents  are  used  for  “testing”,  respectively.  There  are  nine  different  classifiers 
tested:  LDA,  LR,  and  SVMs,  and  the  combination  of  those  three  classifiers  with  two 
wrappers  for  cost-sensitive  learning:  metacost  (MC)  and  costing *  7 .  A  binary  classifier 
was  trained  for  each  of  the  selected  categories  by  considering  the  category  as  positive 
with  the  rest  of  the  data  as  negative  examples.  We  made  use  of  the  LIBSVM8  and 
tested  three  different  kernels,  such  as  linear,  polynomial,  and  Gaussian.  The  Gaussian 

kernel  (width  =  - — — ^ - —  )  was  chosen  due  to  its  best  performance  and  the 

different  cost  factors  are  assigned  9,  C  =  10  ~  100.  Those  values  are  chosen  optimally 
by  10-fold  cross  validation. 

The  experimental  results  are  primarily  analyzed  by  “false  positive,”  “false  nega¬ 
tive,”  and  “cost.”  The  procedure  of  experiments  is  as  follows:  firstly,  pick  one  of  five 
selected  categories;  secondly,  assign  the  cost  to  each  of  examples  according  to  its  im¬ 
portance  using  the  heuristic  described  in  section  4.1;  then,  train  each  of  nine  classifiers 
by  training  examples  with  cost;  finally  compute  three  performance  measure  by  using 
the  contingency  table. 

Figure  thru  2  to  6  show  pairs  of  false  positive  and  false  negative  for  each  of  the 

7The  results  of  naive  Bayes  classifiers  were  removed  due  to  its  poor  performance. 

7  http://www.csie.nui.cdu.tw/~-cjlin/Iibsvm/ 

’Tile  cost  of  constrain  violation  is  set  to  100  if  there  are  relatively  small  amount  of  positive  examples 

available.  Otherwise  it  is  set  to  about  10. 
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Methods 

livestock  (114) 

corn  (253) 

interest  (513) 

acq  (2448) 

earn  (3987) 

SVM 

13967 

66453 

54065 

83141 

108108 

SVM  (w/  costing) 

4035  ±  30 

8851  ±52 

9058  ±  159 

40009  ±  252 

96007  ±331 

SVM  (w/  me) 

7147  ±50 

23596  ±  64 

32011  ±  321 

194165  ±451 

228612  ±453 

LR 

35809 

32759 

60031 

349080 

710631 

LR  (w/  costing) 

484  ±  11 

1333  ±  44 

29614  ±  110 

606  ±  145 

2521  ±  191 

LR  (w/  me) 

34980  ±  35 

32759  ±  79 

60374  ±  154 

386859  ±  1185 

788819  ±263 

LDA 

2638 

66453 

124733 

591300 

908690 

LDA  (w/  costing) 

1461  ±28 

6092  ±  89 

7301  ±  152 

39354  ±  205 

41478 ±  159 

LDA  (w /  me) 

40079  ±  57 

45778  ±71 

8955  ±  157 

51789  ±285 

54084  ±  244 

Cost  for  base  line 

42625 

79084 

139113 

591357 

1090498 

Table  2:  The  cost  by  nine  different  classifiers  are  presented.  The  values  in  bold  face 
are  the  best  for  corresponding  category. 


selected  categories  by  nine  different  classifiers,  which  are  numbered  from  the  left  to 
right:  SVM  (1),  SVM  with  costing  (2),  SVM  with  metacost  (3),  LR  (4),  LR  with 
costing  (5),  LR  with  metacost  (6),  LDA  (7),  LDA  with  costing  (8),  and  LDA  with 
metacost  (9),  respectively. 

Except  the  “interest”  category,  LR  with  costing  showed  the  best  results  that  mini¬ 
mize  false  positive  while  holding  false  negative  low.  In  particular,  for  the  “livestock” 
category,  LR  trained  by  only  18%  training  data  (i.e.,  1781  out  of  9854  documents)  re¬ 
sulted  0%  false  positive  and  2.8%  false  negative  rate.  For  the  costing,  we  carried  out 
five  different  sampling  trials  for  each  category  (i.e.,  1,  3,  5,  10,  and  15)  and  represented 
the  trial  for  the  best  performance.  For  this  category,  a  newly  generated  distribution  by 
10  rejection  sampling  trials  is  used  to  achieve  this  result.  Each  resampled  set  has  only 
about  178  documents.  LDA  with  costing  showed  the  smallest  error  for  the  “interest” 
category  that  is  comprised  of  5.6%  false  positive  and  4.5%  false  negative. 

Table  2  replicates  this  trend  in  terms  of  the  total  cost  for  misclassification.  The 
number  in  parenthesis  next  to  topic  name  is  the  total  number  of  text  documents  belong¬ 
ing  to  that  category.  The  results  reported  for  costing  and  metacost  are  the  average  of  5 
different  runs.  The  bottom  line  entitled  “cost  for  base  line”  is  the  cost  for  a  category  if  a 
classifier  classifies  all  the  testing  examples  incorrectly  (e.g.,  a  classifier  for  "livestock” 
category  will  cause  42625  for  misclassification  cost  if  it  classifies  all  incorrectly).  For 
the  “earn”  category,  LR  with  costing  caused  only  0.002  out  of  the  total  cost  (2521  out 
of  1090498).  For  the  remaining  categories,  the  best-performer  paid  only  less  than  0.05 
out  of  the  total  cost. 

From  the  comparison  with  error-minimizing  classifiers,  the  costing  proved  its  ef¬ 
fectiveness  in  that  it  requires  relatively  small  amount  of  training  data  for  a  better  perfor¬ 
mance.  For  the  “corn”  category,  LR  with  costing,  which  only  used  10%  of  the  training 
data  (i.e.,  986  out  of  9854  documents)  showed  the  best  result  in  terms  of  the  smallest 
loss  (1333  out  of  79084),  zero  false  positive,  and  lower  false  negative  rate  (0.039).  The 
LR  classifier  was  trained  by  a  sample  set  by  three  rejection  sampling  trials  that  is  com¬ 
prised  of  458  positive  and  528  negative  examples.  The  smallest  loss  implies  that  it  is 
expected  to  pay  1.1%  of  the  total  loss  caused  by  incorrect  confidential  access  control 
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(i.e.,  misclassification).  From  the  false  positive  perspective  (zero  false  alarm),  you  do 
not  worry  at  all  about  the  leaking  of  confidential  information.  39%  false  negative  rate 
means  that  there  would  be  39  out  of  1000  valid  requests  to  the  confidential  information 
that  are  mistakenly  rejected.  This  inconveniences  employees  because  they  have  to  ac¬ 
cess  particular  information  for  their  projects,  but  the  system  does  not  authorize.  This 
trend  holds  good  for  the  remaining  four  categories. 

The  primary  reason  that  makes  costing  effective  is  its  ability  to  generate  a  sample 
that  is  comprised  of  nearly  even  number  of  positive  and  negative  examples.  For  ex¬ 
ample,  each  resampled  set  for  the  “interest”  category  has  only  about  500  documents 
(actually  ranging  from  502  to  555).  Ranging  about  55%  to  60%  of  the  documents  in 
each  set  are  positive,  even  though  on  the  original  dataset  it  was  only  3.6%  (513  out 
of  14,128).  Moreover  it  takes  relatively  less  time  to  train  the  classifier  with  costing 
because  sample  set  is  far  smaller  than  the  original  training  example.  However  this 
property  hindered  the  performance  of  SVM  because  its  performance  is  sensitive  to  a 
skewed  class  distribution,  even  with  regularization  (i.e.,  assigning  a  higher  cost  factor, 
C  =  100).  In  other  words,  it  is  difficult  for  a  SVM  to  find  the  optimal  hyperplane  sepa¬ 
rating  two  classes  if  the  size  of  one  class  is  relatively  smaller  than  the  others.  The  result 
in  table  2  confirmed  this  hypothesis  in  that  the  more  training  examples  are  available  for 
SVM  without  wrappers,  the  less  cost  it  is  paid  (e.g.,  from  32%  (13967  out  of  42625) 
for  “livestock”  category  to  10%  (108108  out  of  1090498)  for  “earn”  category). 

Another  reason,  we  believe,  that  the  cost  resulted  in  good  performance  is  that  our 
method  for  assigning  cost  distinguished  well  positive  from  negative  examples.  By 
assigning  at  least  more  than  100  times  cost  to  positive  examples,  it  helps  the  costing 
choose  the  more  important  examples  as  sample. 

The  metacost  did  not  show  a  good  performance  because  there  might  be  overfitting 
caused  by  a  random  resampling  with  replacement.  To  avoid  such  overfitting,  one  might 
think  a  resampling  without  replacement  where  an  instance,  x,  is  drawn  from  a  distri¬ 
bution  and  the  next  sample  is  drawn  from  the  set  S  —  x.  However  this  approach  also 
fails  because  it  keeps  the  size  of  the  original  distribution  smaller  and  eventually  there 
is  nothing  left  to  be  chosen. 


5  Related  Work 

As  many  data  mining  techniques  have  been  applied  to  various  real-world  application 
domains,  the  usefulness  of  the  cost-sensitive  learning  drew  attention  from  the  public. 

Lee  and  his  colleagues  [7]  introduced  a  cost-sensitive  framework  for  the  intrusion 
detection  domain  and  analyzed  cost  factors  in  detail.  Particularly,  they  identify  the 
major  cost  factors  (e.g.,  costs  for  development,  operation,  damages  and  responding  to 
intrusion)  and  then  applied  a  rule  induction  learning  technique  (i.e.,  RIPPER)  to  this 
cost  model,  in  order  to  maximize  security  while  minimizing  costs.  However  their  cost 
model  should  be  changed  manually  if  a  system’s  cost  factors  are  changed. 

Maloof  [8]  utilized  two  sampling  methods,  such  as  under-  and  over-sampling  for 
efficient  learning  of  skewed  data  set.  The  under/over  sampling  are  stratification  tech¬ 
niques  that  generate  a  set  of  samples  from  the  original  data  according  to  a  certain 
criteria  [2].  In  the  under-sampling,  all  instances  of  the  class  j  with  highest  P'(j)  are  re- 
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tained,  and  a  fraction  P'(i)/P'(j)  of  the  examples  of  each  other  class  i  is  chosen  at  ran¬ 
dom  for  inclusion  in  the  resampled  training  set,  where  P'{j)  =  C(j)P (j ) /  JA  C (j )P(j) 
and  C(j)  =  C(i\j).  C(j)  is  the  cost  of  misclassifying  an  instance  of  class  j,  irre¬ 

spective  of  the  class  predicted.  In  the  over-sampling,  on  the  contrary,  all  examples  of 
the  class  j  with  lowest  P'(j)  are  retained,  and  then  the  examples  of  every  other  class 
i  are  duplicated  approximately  P' (i) / P' (j)  times  in  the  training  set.  Both  sampling 
methods  have  drawbacks;  the  under-sampling  reduces  the  amount  of  data  available, 
which  might  cause  the  increase  of  cost  whereas  the  over-sampling  avoids  the  loss  of 
training  data  but  may  significantly  increase  learning  time.  Moreover  these  techniques 
are  only  applicable  when  there  is  a  slight  difference  in  the  class  frequency.  In  our 
case,  these  two  techniques  resulted  in  the  following  distribution  of  five  selected  topics; 
under-sampling :  livestock  (114  — >  0.35),  corn  (253  — >  3.23),  interest  (513  — >  5.46), 
acq  (2448  — >  1278.68),  earn  (3987  — >  3987)  and  over-sampling :  livestock  (114  — > 
9844.61),  corn  (253  ->  89809),  interest  (513  ->  617137),  acq  (2448  -»■  35452826), 
earn  (3987  — >  1 10543424).  Since  there  are  too  few  examples  available  for  training  by 
under-sampling  whereas  there  are  too  many  examples  to  computationally  process  by 
over-sampling,  we  did  not  apply  those  two  methods  to  our  scenario. 

Fan  and  his  colleagues  [16]  proposed  a  new  method  called  "AdaCost”  for  reduc¬ 
ing  misclassification  cost  using  boosting.  In  particular,  the  idea  is  to  take  an  unequal 
care  for  examples  according  to  their  cost  while  learning  rule  -  by  assigning  high  initial 
weights  to  costly  weights  and  by  updating  rule  in  taking  cost  into  account.  Their  ap¬ 
proach  is  similar  to  the  costing  in  that  the  performance  is  improved  by  averaging  (i.e., 
weighted  bagging  vs  bagging).  The  finding  in  comparison  of  AdaCost  with  AdaBoost 
by  "Chase  credit  card  data  ”  is  quite  similar  to  ours  -  AdaBoost  reduces  misclassifica¬ 
tion  error  significantly  but  does  not  reduce  cost  for  misclassifying. 


6  Conclusion  and  Future  Work 

In  this  paper  we  test  the  effectiveness  of  cost-sensitive  learning  for  confidential  access 
control.  The  goal  of  this  work  is  to  develop  a  reliable  process  for  confidential  access 
control  based  on  the  need-to-know  principle;  the  request  for  access  to  a  unit  of  confi¬ 
dential  information  is  accepted  only  if  the  content  of  the  requested  item  is  relevant  to 
the  requester’s  task.  We  model  such  a  dichotomous  decision  (i.e.,  to  reject  or  accept 
the  request)  in  a  machine  learning  framework.  A  false  positive  occurs  when  the  system 
accepts  a  request  that  should  not  have  been  accepted  whereas  a  false  negative  occurs 
when  the  system  rejects  a  request  that  should  have  been  accepted.  For  both  errors,  the 
system  pays  the  cost  for  misclassification.  From  the  security  perspective,  the  cost  for  a 
false  positive  is  more  expensive  than  that  of  false  negative  because  the  former  is  a  seri¬ 
ous  security  problem  because  confidential  information,  which  should  not  be  revealed, 
can  be  accessed. 

In  order  to  achieve  our  goal  we  need  to  find  a  classifier  that  minimizes  the  cost 
and  false  positive  rate  while  holding  false  negative  rate  reasonable  row.  We  utilized 
two  wrappers  for  cost-sensitive  learning  because  the  best-performing  error-minimizing 
classifiers  do  not  concern  unequal  cost  for  misclassification.  From  the  comparison  of 
the  cost-sensitive  learners  with  the  error-minimizing  classifiers,  we  found  that  costing 
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showed  the  best  performance.  In  particular,  it  requires  far  less  training  data  for  much 
better  results,  in  terms  of  the  smallest  cost  paid,  the  lowest  false  positive  rate,  and  the 
lower  false  negative  rate.  The  benefit  of  smaller  training  data  is  two-fold;  First,  obvi¬ 
ously  it  takes  less  time  to  train  the  classifier;  Second,  it  enables  a  human  administrator 
to  conveniently  identify  arbitrary  subsets  of  confidential  information,  in  order  to  train 
the  initial  classifier. 

Since  we  found  our  metric  for  cost  assignment  useful,  as  future  work,  we  would 
like  to  generalize  this  idea.  In  this  work,  we  primarily  focused  on  testing  this  frame¬ 
work  for  the  text  domain.  We  would  like  to  investigate  the  usefulness  of  this  approach 
in  different  type  of  media,  such  as  image,  video,  etc.  Although  to  our  knowledge,  the 
machine  learning  approach  is  a  novel  one  for  access  control,  it  would  be  very  inter¬ 
esting  if  we  compare  the  effectiveness  of  our  framework  with  conventional  document 
management  systems  (e.g.,  ACL-based  systems). 
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Figure  3:  A  pair  of  of  false  positive  (filled  bar)  and  false  negative  (empty  bar)  for 
“corn”  category  is  presented. 


Figure  4:  A  pair  of  of  false  positive  (filled  bar)  and  false  negative  (empty  bar)  for 
“interest”  category  is  presented. 
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Figure  5:  A  pair  of  of  false  positive  (filled  bar)  and  false  negative  (empty  bar)  for  “acq” 
category  is  presented. 


Figure  6:  A  pair  of  of  false  positive  (filled  bar)  and  false  negative  (empty  bar)  for  “earn” 
category  is  presented. 
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